The analysis data frames are very similar yet yielded very different results.
Recall that the only difference between the two models is the number of demographic variables included as covariates. The number of rows or observations is the same, as are the outcome and the other covariates included in the model. We can use the all_equal() function of the dplyr package to compare the number of columns of our Donohue-like data and our Mustard and Lott-like data.
The only difference between the two data frames rests in how the demographic variables were parameterized.
Clearly, this had an effect on the results of the analysis.
Let’s explore how this occurred.
When seemingly independent variables are highly related to one another, the relationships estimated in an analysis may be distorted.
In regression analysis, this distortion is often a by-product of a violation of the independence assumption. This distortion, if large enough, can impact statistical inference.
The phenomenon called multicollinearity occurs when independent variables are highly related to one another.
There are several ways we can diagnose multicollinearity.
Correlation
One way we can evaluate the relationships between variables is by examining the correlation between variable pairs.
It is important to note that multicollinearity and correlation are not one and the same. Correlation can be thought of as the strength of a linear relationship between variables. On the other hand, collinearity can be thought of as two independent variables having a linear relationship or association. Multicollinearity can be thought of as collinearity among multiple (3+) regressors (independent variables) in a regression analysis, which can occur when regressors are highly correlated.
According to Wikipedia:
multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy.
Thus collinearity describes linear prediction or association between variables. Often those variables will be highly correlated.
The issue with this in linear regression, is that linear regression aims to determine how a one unit change in a regressor influences a one unit change in the dependent variable. In fact, this is what the coefficient estimates aim to tell us for each regressor.
However, if our regressors are also linearly related, than it becomes difficult to determine the effect of each regressor on the dependent variable and multicollinearity can cause instability in our coefficient estimates, making them unreliable. Coefficients may be inflated, deflated, or their signs may change.
If you want to read further on this topic, here and here are a couple of interesting discussions.
In the next sections, we will describe ways to detect multicollinearity in our covariates both using visual displays of data and using computational techniques.
Scatter plots
One of the ways to diagnose multicollinearity in a regression model is to first see if there are regressors that are highly correlated. If so, this suggests that we should investigate further to see if these variables are in fact linearly predicting one another.
One way to look at correlation across pairs of variables is to use the ggpairs() function of the GGally package.
[1] "YEAR" "STATE"
[3] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[5] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[7] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[9] "Unemployment_rate" "Poverty_rate"
[11] "Viol_crime_count" "Population"
[13] "police_per_100k_lag" "RTC_LAW_YEAR"
[15] "RTC_LAW" "TIME_0"
[17] "TIME_INF" "Viol_crime_rate_1k"
[19] "Viol_crime_rate_1k_log" "Population_log"
We can see that for the non-demographic variables, there is very little correlation between the pairs of variables. Only the unemployment rate and the poverty rate show relatively strong correlation, as one might expect.
Heatmaps
Another way to look at correlation if we have many variables is to show the strength of correlation between pairs of variables using a heatmap, where the intensity of the color indicates the strength of the correlation between two variables.
Let’s do this now for the demographic variables for each analysis.
The ggcorrplot() function of the ggpcorrplot package is one way to create such a heatmap.
As input to the plotting function, we first need to calculate the correlation values, which we will do using the cor() function of the stats package.
To label our legend with the Greek letter \(\rho\), we will use the base expression() function, which will convert the written form of "rho" to the Greek letter.

We can see that many of the demographic variables are highly correlated with one another, either positively or negatively. In this case, the sign does not matter, in terms of the effect the collinearity could have on our modeling results.
The presence of correlation between variables suggests that we might have multicollinearity. However it does not necessarily mean that we do. So how can we assess this?
Coefficient estimate instability
One way to look at the possible influence of multicollinearity is to look at the stability of the coefficient estimates under perturbations of the data.
We will focus on the RTC_LAW variable coefficient estimate, as this is of particular interest in our case.
To do so we will perform a process called resampling. This involves performing multiple iterations of our analysis, but with only a subset or sub-sample of our original data. In our case we will remove one observation and see if that changes our coefficient estimate results.
To do this we will use some functions in the rsample package which is very useful for splitting data in various ways.
We will use the loo_cv() function which stands for leave one out [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics){target="_blank"}. This will allow us to split our data into every possible subset where a unique observation is left out of the data.
This function will however only prepare the data to be split.
To get the remaining data after the removal of the observation that is left out we will use a function called training(). These function names arise from the fact that these functions are often used for in machine learning applications where the data is split between a larger training set and a smaller testing set. Thus we want the larger \(n-1\) subset, as opposed to the single value that is removed, (which we could get with the testing() function).
Click here to see an example of how this works.
First we will make a toy dataset that is very simple called test using the tibble() function of the tibble package:
# A tibble: 3 x 1
x
<dbl>
1 1
2 2
3 3
Now we will use the loo_cv() to create leave one out splits:
# Leave-one-out cross-validation
# A tibble: 3 x 2
splits id
<list> <chr>
1 <split [2/1]> Resample1
2 <split [2/1]> Resample2
3 <split [2/1]> Resample3
We can take a look at a single split of the data using the pull() function:
[[1]]
<Analysis/Assess/Total>
<2/1/3>
[[2]]
<Analysis/Assess/Total>
<2/1/3>
[[3]]
<Analysis/Assess/Total>
<2/1/3>
Here you can see that 2 values are intended for the training set (also called Analysis set), 1 value is intended for the testing set (also called Assessment set), and 3 values were present initially.
Now we will use the training() function to get the data without the observation that is set aside. Here is the data for the first subset:
# A tibble: 2 x 1
x
<dbl>
1 1
2 2
Now we will use the map() function of purrr to get all possible training subset of the data.
[[1]]
# A tibble: 2 x 1
x
<dbl>
1 1
2 2
[[2]]
# A tibble: 2 x 1
x
<dbl>
1 2
2 3
[[3]]
# A tibble: 2 x 1
x
<dbl>
1 1
2 3
We can see that there are 3 possible subsets that leave one value out. All 3 possible subsets are created using this method. This method will always create the same number of subsets as there are unique values or rows in the data.
Now we will use this method with the data from our Donohue-like analysis, and since this data has 1395 rows, 1395 subsets will be created that leave out one row. The idea is to fit our panel regression model on each subset of the data, and then examine how the coefficient estimates from each of these model fits vary as the sample changes slightly. With collinear predictors, we expect that our coefficient estimates may be unstable and subject to change under even small perturbations of the data.
First we will create the splits using the loo_cv() function. Notice that since this process involves randomly sampling data points from our data set, we use the set.seed() function to ensure reproducibility of the results.
# Leave-one-out cross-validation
# A tibble: 1,395 x 2
splits id
<list> <chr>
1 <split [1.4K/1]> Resample1
2 <split [1.4K/1]> Resample2
3 <split [1.4K/1]> Resample3
4 <split [1.4K/1]> Resample4
5 <split [1.4K/1]> Resample5
6 <split [1.4K/1]> Resample6
7 <split [1.4K/1]> Resample7
8 <split [1.4K/1]> Resample8
9 <split [1.4K/1]> Resample9
10 <split [1.4K/1]> Resample10
# … with 1,385 more rows
Now we will use the training() function to select the remaining data without the value that was removed for each split:
Rows: 1,394
Columns: 20
$ YEAR <fct> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1…
$ STATE <fct> Alaska, Alaska, Alaska, Alaska, Alaska, Ala…
$ Black_Male_15_to_19_years <pseries> 0.1670456, 0.1732299, 0.1737069, 0.1709…
$ Black_Male_20_to_39_years <pseries> 0.9933775, 1.0028219, 1.0204445, 1.0312…
$ Other_Male_15_to_19_years <pseries> 1.1297816, 1.1244412, 1.0698208, 0.9882…
$ Other_Male_20_to_39_years <pseries> 2.963329, 2.974775, 3.015071, 3.008048,…
$ White_Male_15_to_19_years <pseries> 3.627805, 3.558261, 3.391844, 3.222002,…
$ White_Male_20_to_39_years <pseries> 18.28852, 18.12821, 18.10666, 17.90600,…
$ Unemployment_rate <pseries> 9.6, 9.4, 9.9, 9.9, 9.8, 9.7, 10.9, 10.…
$ Poverty_rate <pseries> 9.6, 9.0, 10.6, 12.6, 9.6, 8.7, 11.4, 1…
$ Viol_crime_count <pseries> 1919, 2537, 2732, 2940, 3108, 3031, 304…
$ Population <pseries> 404680, 418519, 449608, 488423, 513697,…
$ police_per_100k_lag <pseries> 194.7218, 200.2299, 191.0553, 364.2335,…
$ RTC_LAW_YEAR <pseries> 1995, 1995, 1995, 1995, 1995, 1995, 199…
$ RTC_LAW <pseries> FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ TIME_0 <pseries> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ TIME_INF <pseries> 2010, 2010, 2010, 2010, 2010, 2010, 201…
$ Viol_crime_rate_1k <pseries> 4.742018, 6.061851, 6.076404, 6.019373,…
$ Viol_crime_rate_1k_log <pseries> 1.556463, 1.802015, 1.804413, 1.794983,…
$ Population_log <pseries> 12.91085, 12.94448, 13.01613, 13.09894,…
[1] 1395
As expected the first subset has 1394 rows and there are 1395 subsets.
Let’s see what observation was left out in the first subset:
YEAR STATE Black_Male_15_to_19_years Black_Male_20_to_39_years
Texas-1988 1988 Texas 0.5599219 2.091018
Other_Male_15_to_19_years Other_Male_20_to_39_years
Texas-1988 0.09824717 0.4413413
White_Male_15_to_19_years White_Male_20_to_39_years
Texas-1988 3.447339 14.79001
Unemployment_rate Poverty_rate Viol_crime_count Population
Texas-1988 7.3 18 109499 16667146
police_per_100k_lag RTC_LAW_YEAR RTC_LAW TIME_0 TIME_INF
Texas-1988 298.2274 1996 FALSE 1980 2010
Viol_crime_rate_1k Viol_crime_rate_1k_log Population_log
Texas-1988 6.569751 1.882476 16.62895
YEAR STATE Black_Male_15_to_19_years Black_Male_20_to_39_years
Texas-1988 1988 Texas 0.5599219 2.091018
Other_Male_15_to_19_years Other_Male_20_to_39_years
Texas-1988 0.09824717 0.4413413
White_Male_15_to_19_years White_Male_20_to_39_years
Texas-1988 3.447339 14.79001
Unemployment_rate Poverty_rate Viol_crime_count Population
Texas-1988 7.3 18 109499 16667146
police_per_100k_lag RTC_LAW_YEAR RTC_LAW TIME_0 TIME_INF
Texas-1988 298.2274 1996 FALSE 1980 2010
Viol_crime_rate_1k Viol_crime_rate_1k_log Population_log
Texas-1988 6.569751 1.882476 16.62895
It looks like the Texas data from 1988 was removed from the first split.
OK, so now let’s fit our panel regression on the first subset of data like we did previously. Note that this causes our data to be an unbalanced panel. This does not require any adjustment to the code to model the data, but you will notice that the output will now say “unbalanced”.
Twoways effects Within Model
Call:
plm(formula = Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years +
White_Male_20_to_39_years + Black_Male_15_to_19_years + Black_Male_20_to_39_years +
Other_Male_15_to_19_years + Other_Male_20_to_39_years + Unemployment_rate +
Poverty_rate + Population_log + police_per_100k_lag, data = DONOHUE_subsets[[1]],
effect = "twoways", model = "within", index = c("STATE",
"YEAR"))
Unbalanced Panel: n = 45, T = 30-31, N = 1394
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.5795659 -0.0893732 -0.0014097 0.0865870 1.1118927
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE 0.01812436 0.01664364 1.0890 0.2763694
White_Male_15_to_19_years -0.00103284 0.02724763 -0.0379 0.9697687
White_Male_20_to_39_years 0.03462260 0.00973037 3.5582 0.0003868 ***
Black_Male_15_to_19_years -0.05699742 0.05747236 -0.9917 0.3215097
Black_Male_20_to_39_years 0.12591876 0.01931901 6.5179 1.016e-10 ***
Other_Male_15_to_19_years 0.69114956 0.11325143 6.1028 1.371e-09 ***
Other_Male_20_to_39_years -0.30242747 0.03812860 -7.9318 4.603e-15 ***
Unemployment_rate -0.01698131 0.00490345 -3.4631 0.0005512 ***
Poverty_rate -0.00782727 0.00295795 -2.6462 0.0082384 **
Population_log -0.17899715 0.06044257 -2.9614 0.0031173 **
police_per_100k_lag 0.00060326 0.00013692 4.4058 1.140e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 43.198
Residual Sum of Squares: 36.701
R-Squared: 0.15039
Adj. R-Squared: 0.095183
F-statistic: 21.0489 on 11 and 1308 DF, p-value: < 2.22e-16
Indeed, we can see that now we have an unbalanced panel with N = 1394 observations instead of 1395, as expected.
Now that we have our subsets, we want to write a function to fit a panel regression using plm()on each subset. See this case study for more information on writing functions.
Now we can apply this function to each of our subsets simultaneously using the map() function of the purrr package.
Great! Now we want to do the same thing for the Mustard and Lott data.
We need to create a different function to fit the data to account for the larger number of demographic variables. We will use the formula that we made previously.
Now we will combine the output so that we can make a plot to visualize the results that we obtained, i.e., to look at how the variation in our coefficient estimates across data subsets varies between the Donohue and the Lott and Mustard models. First let’s name each subset that we created.
Now we can combine the tibbles within the list of tibbles for the subsets_models_DONOHUE and subsets_models_LOTT data.
To do this we will use the bind_rows() function of the dplyr package with the .id = "ID" argument, which will create a new variable called ID that will list the name of the tibble the data came from.
Then we will combine the data from both the Donohue and Lott simulations.
# A tibble: 6 x 7
ID term estimate std.error statistic p.value Analysis
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 DONOHUE_1 RTC_LAWTRUE 0.0181 0.0166 1.09 2.76e- 1 Analysis…
2 DONOHUE_1 White_Male_15_to_19… -0.00103 0.0272 -0.0379 9.70e- 1 Analysis…
3 DONOHUE_1 White_Male_20_to_39… 0.0346 0.00973 3.56 3.87e- 4 Analysis…
4 DONOHUE_1 Black_Male_15_to_19… -0.0570 0.0575 -0.992 3.22e- 1 Analysis…
5 DONOHUE_1 Black_Male_20_to_39… 0.126 0.0193 6.52 1.02e-10 Analysis…
6 DONOHUE_1 Other_Male_15_to_19… 0.691 0.113 6.10 1.37e- 9 Analysis…
# A tibble: 6 x 7
ID term estimate std.error statistic p.value Analysis
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 LOTT_13… Other_Male_50_to_64_… -3.91e+0 0.375 -10.4 1.66e-24 Analysis…
2 LOTT_13… Other_Male_65_years_… -4.16e+0 0.369 -11.3 3.97e-28 Analysis…
3 LOTT_13… Unemployment_rate -5.43e-3 0.00437 -1.24 2.14e- 1 Analysis…
4 LOTT_13… Poverty_rate -5.75e-3 0.00253 -2.27 2.33e- 2 Analysis…
5 LOTT_13… Population_log -2.16e-1 0.0846 -2.55 1.08e- 2 Analysis…
6 LOTT_13… police_per_100k_lag 6.96e-4 0.000133 5.22 2.07e- 7 Analysis…
Now we will make a set of parallel boxplots using the geom_boxplot() function of the coefficient estimates of the RTC_LAWTRUE variable for each simulation.
Since there are many variables in both analyses, we will use the facet_grid() function of the ggplot2 package to allow us to separate the data for each analysis into subplots. The argument scale = "free_x" and drop = TRUE allow us to only include the variables that were present in Analysis 1, as opposed to empty spots for the variables that were in Analysis 2 but not in Analysis 1. The space = "free" argument removes the extra space from the dropped variables.
Question Opportunity
What happens if don’t use the drop = TRUE argument or the space = "free" argument?
simulation_plot <- simulations %>%
ggplot(aes(x = term, y = estimate)) +
geom_boxplot() +
facet_grid(.~Analysis, scale = "free_x", space = "free", drop = TRUE) +
labs(title = "Coefficient estimates",
subtitle = "Estimates across leave-one-out analyses",
x = "Term",
y = "Coefficient",
caption = "Results from simulations") +
theme_linedraw() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 70, hjust = 1),
strip.text.x = element_text(size = 14, face = "bold"))
simulation_plot

Here, we can start to see that there is a bit more variability in the coefficient estimates from the leave-one-out results for Analysis 2.
For our display purposes, we would like to order the covariates so that they are displayed similarly across the two panels. This will allow us to better observe how the coefficients of the same covariate behave in the different analyses. We use the mutate function to convert the term variable to a factor, where we assign the non-demographic variables to be the first levels, with the sorted (in alphabetical order using the base sort() function) demographic variables coming afterwards.
You might notice that the names of the demographic variable values in the term variable of the simulations data all have the word “years”. We can use the str_subset() function of the stringr package to select just the demographic variables based on the word "years. In contrast, we can use the negate = TRUE argument to select all the other variables, the non-demographic variable values. We can use the unique() base function to grab just the unique values of the term variable.
[1] "RTC_LAWTRUE" "Unemployment_rate" "Poverty_rate"
[4] "Population_log" "police_per_100k_lag"
[1] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[3] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[5] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[7] "White_Female_10_to_19_years" "White_Female_20_to_29_years"
[9] "White_Female_30_to_39_years" "White_Female_40_to_49_years"
[11] "White_Female_50_to_64_years" "White_Female_65_years_and_over"
[13] "White_Male_10_to_19_years" "White_Male_20_to_29_years"
[15] "White_Male_30_to_39_years" "White_Male_40_to_49_years"
[17] "White_Male_50_to_64_years" "White_Male_65_years_and_over"
[19] "Black_Female_10_to_19_years" "Black_Female_20_to_29_years"
[21] "Black_Female_30_to_39_years" "Black_Female_40_to_49_years"
[23] "Black_Female_50_to_64_years" "Black_Female_65_years_and_over"
[25] "Black_Male_10_to_19_years" "Black_Male_20_to_29_years"
[27] "Black_Male_30_to_39_years" "Black_Male_40_to_49_years"
[29] "Black_Male_50_to_64_years" "Black_Male_65_years_and_over"
[31] "Other_Female_10_to_19_years" "Other_Female_20_to_29_years"
[33] "Other_Female_30_to_39_years" "Other_Female_40_to_49_years"
[35] "Other_Female_50_to_64_years" "Other_Female_65_years_and_over"
[37] "Other_Male_10_to_19_years" "Other_Male_20_to_29_years"
[39] "Other_Male_30_to_39_years" "Other_Male_40_to_49_years"
[41] "Other_Male_50_to_64_years" "Other_Male_65_years_and_over"
Now we can create the order of the values of the term variable using the factor argument and the levels argument.
[1] "RTC_LAWTRUE" "Unemployment_rate"
[3] "Poverty_rate" "Population_log"
[5] "police_per_100k_lag" "Black_Female_10_to_19_years"
[7] "Black_Female_20_to_29_years" "Black_Female_30_to_39_years"
[9] "Black_Female_40_to_49_years" "Black_Female_50_to_64_years"
[11] "Black_Female_65_years_and_over" "Black_Male_10_to_19_years"
[13] "Black_Male_15_to_19_years" "Black_Male_20_to_29_years"
[15] "Black_Male_20_to_39_years" "Black_Male_30_to_39_years"
[17] "Black_Male_40_to_49_years" "Black_Male_50_to_64_years"
[19] "Black_Male_65_years_and_over" "Other_Female_10_to_19_years"
[21] "Other_Female_20_to_29_years" "Other_Female_30_to_39_years"
[23] "Other_Female_40_to_49_years" "Other_Female_50_to_64_years"
[25] "Other_Female_65_years_and_over" "Other_Male_10_to_19_years"
[27] "Other_Male_15_to_19_years" "Other_Male_20_to_29_years"
[29] "Other_Male_20_to_39_years" "Other_Male_30_to_39_years"
[31] "Other_Male_40_to_49_years" "Other_Male_50_to_64_years"
[33] "Other_Male_65_years_and_over" "White_Female_10_to_19_years"
[35] "White_Female_20_to_29_years" "White_Female_30_to_39_years"
[37] "White_Female_40_to_49_years" "White_Female_50_to_64_years"
[39] "White_Female_65_years_and_over" "White_Male_10_to_19_years"
[41] "White_Male_15_to_19_years" "White_Male_20_to_29_years"
[43] "White_Male_20_to_39_years" "White_Male_30_to_39_years"
[45] "White_Male_40_to_49_years" "White_Male_50_to_64_years"
[47] "White_Male_65_years_and_over"
Looks good!
Now we just need to run the same code again to create the plot, but now the order of the x axis values will be different.
simulation_plot <- simulations %>%
ggplot(aes(x = term, y = estimate)) +
geom_boxplot() +
facet_grid(.~Analysis, scale = "free_x", space = "free", drop = TRUE) +
labs(title = "Coefficient estimates",
subtitle = "Estimates across leave-one-out analyses",
x = "Term",
y = "Coefficient",
caption = "Results from simulations") +
theme_linedraw() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 70, hjust = 1),
strip.text.x = element_text(size = 14, face = "bold"))
simulation_plot

We can see that the range of coefficient estimates when only one observation is removed is much larger in Analysis 2 for nearly all variables, but particularly for many of the additional demographic variables.
Let’s make a plot showing the summary of the overall coefficient instability.
To do this we will calculate the standard deviation of coefficient estimates for each variable across all of the simulations. Thus we will group by the Analysis and the term variables now that our data is in long format. We will use the sd() function of the stats package to calculate the standard deviation.
First, we will display an interactive table of these standard deviations. Try searching for “RTC”, and you can compare the standard deviations of the coefficients for the RTC_LAWTRUE variable across the two analyses. To take a better look at our data we will use the datatable() function of the DT package which will create an interactive searchable table.
Now we will make a plot of this data, including SDs from all coefficients in each model.
simulation_plot <- coeff_sd %>%
ggplot(aes(x = Analysis, y = SD)) +
geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
labs(title = "Coefficient variability",
subtitle = "SDs of coefficient estimates from leave-one-out analysis",
x = "Term",
y = "Coefficient Estimate \n Standard Deviations",
caption = "Results from simulations") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(size = 8, color = "black"),
axis.text.y = element_text(color = "black"))
simulation_plot

Here we can clearly see that overall the coefficient estimates are much less stable in Analysis 2. This is an indication that we may have multicollinearity in our data.
VIF
Another way of evaluating the presence and severity of multicollinearity is to calculate the variance inflation factor (VIF) .
According to Wikipedia:
It provides an index that measures how much the variance (the square of the estimate’s standard deviation) of an estimated regression coefficient is increased because of collinearity.
The variance inflation factor (VIF) is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone.
VIF values can be calculated for each explanatory variable in a model by performing the following calculation:
- Run another ordinary least squares (OLS) linear regression with one of the explanatory variables of your model of interest (\(X_i\)) as the dependent variable and keep the remaining explanatory variables as explanatory variables.
So generically speaking say this is our model:
\[Y = β_0 + β_1X_1 + β_2X_2 + β_3X_3 + e \]
We have three explanatory variables (\(X_1\), \(X_2\), and \(X_3\)).
If we want to calculate the VIF value for \(X_1\) we would need to perform another OLS model, where \(X_1\) is now the dependent variable explained by the other explanatory variables.
\[X_1 = β_0 + β_2X_2 + β_3X_3 + e\]
The \(R^2\) coefficient of determination (also called R squared value) from this regression is then used to calculate the VIF as follows:
\[\frac{1}{1-R^{2}}\]
The \(R^2\) value is in this case the proportion of variance in \(X_1\) explained by the other variables (\(X_2\) and \(X_3\)).
VIF values are typically calculated for each explanatory variable when evaluating multicollinearity of a model.
The calculation for a single variable is: \[VIF_i = \frac{1}{1-R_i^{2}}\] Where \(i\) is the index of each explanatory variable.
Recall that according to Wikipedia:
The variance inflation factor (VIF) is the quotient of the variance in a model with multiple terms by the variance of a model with one term alone.
The \(R^2\) value ranges from 0 to 1, and if the variation of one variable is highly explained by the other variables, the \(R^2\) will approach 1. Thus the denominator in the VIF calculation \(1-R_i^{2}\) (which is sometimes referred to as tolerance) will be smaller and the VIF value will be larger.
Thus, higher VIF vales indicate more severe multicollinearity. Typically a threshold of a tolerance of less than 0.10 and/or a VIF of 10 or above is used as a rule of thumb to determine if the presence of multicollinearity might be problematic.
Please see this article for a thorough explanation of how to interpret VIF values and how to decide what to do if your model has high multicollinearity.
So how do we calculate VIF values in R?
We could do this manually creating many linear regressions, but that would obviously be time consuming. Luckily, the car package has a function called vif() that will calculate VIF values. However, there is one wrinkle: the vif() function is not compatible with the output of the plm function. There is however a workaround that allows us to fit a similar model using the standard lm() function on data where we have removed the within-individual means. While this won’t give us exactly the same results in terms of the standard errors of our estimates, it will give us some idea of the VIF values for the covariates in our model. We are following the steps outlined here, a really nice summary of panel data modeling in R.
Once we have calculated our VIF values, we will create nicer looking output of the data using the as_tibble() function of the tibble package to create a tibble and add the variable names as another column.
Recall that we previously created the DONOHUE_OUTPUT object like so:
What we have modeled is how a state’s violent crime rate has changed with modifications of RTC law status and over time, relative to itself, and how this compares to similar changes in violent crime of another state relative to itself.
The coefficients from this model then are, in an oversimplified explanation, centered across all states and across all time points. This is called “demeaned” data.
We will now use this model output to create a data frame of demeaned data (where the effect of time is accounted for as are the within-individuals effects, in this case the different states). We will make a model matrix of this data by using the model.matrix() function of the stats package and then we will create a data frame from this using the base as.data.frame() function.
Rows: 1,395
Columns: 11
$ RTC_LAWTRUE <dbl> -0.1491039, -0.1491039, -0.1491039, -0.1491…
$ White_Male_15_to_19_years <dbl> -0.17167489, -0.06619055, -0.06629357, -0.0…
$ White_Male_20_to_39_years <dbl> 3.2089257, 2.8592259, 2.7093166, 2.4692997,…
$ Black_Male_15_to_19_years <dbl> -0.098689711, -0.076244940, -0.056288623, -…
$ Black_Male_20_to_39_years <dbl> 0.211104389, 0.178202827, 0.160277326, 0.14…
$ Other_Male_15_to_19_years <dbl> 0.090970393, 0.082703650, 0.030284388, -0.0…
$ Other_Male_20_to_39_years <dbl> 0.017436309, 0.005636164, 0.023425966, -0.0…
$ Unemployment_rate <dbl> 0.872831541, 0.203942652, -1.202724014, -1.…
$ Poverty_rate <dbl> -0.2531900, -1.7598566, -1.1331900, 0.61569…
$ Population_log <dbl> -0.223047491, -0.200306695, -0.139763637, -…
$ police_per_100k_lag <dbl> 13.0776479, 16.2876170, 2.6545268, 140.1817…
Notice that this does not contain any outcome data. We will add this by taking the outcome of the Within() function of the plm package to get the violent crime data after accounting for the state specific effects. According to the documentation for this package:
Within returns a vector containing the values in deviation from the individual means (if effect = “individual”, from time means if effect = “time”), the so called demeaned data.
Also recall that the d_panel_DONOHUE data is just the Donohue data in panel format.
Now we will fit the demeaned data to the model:
Now we are ready to use the vif() function of the car package to calculate the VIF values:
RTC_LAWTRUE White_Male_15_to_19_years White_Male_20_to_39_years
1.097853 1.172339 1.738459
Black_Male_15_to_19_years Black_Male_20_to_39_years Other_Male_15_to_19_years
1.344193 1.653712 1.586648
Other_Male_20_to_39_years Unemployment_rate Poverty_rate
1.529688 1.244667 1.270321
Population_log police_per_100k_lag
1.153933 1.204491
Now we will use the as_tibble() function of the tibble package to nicely put this together.
# A tibble: 11 x 2
VIF Variable
<dbl> <chr>
1 1.10 RTC_LAWTRUE
2 1.17 White_Male_15_to_19_years
3 1.74 White_Male_20_to_39_years
4 1.34 Black_Male_15_to_19_years
5 1.65 Black_Male_20_to_39_years
6 1.59 Other_Male_15_to_19_years
7 1.53 Other_Male_20_to_39_years
8 1.24 Unemployment_rate
9 1.27 Poverty_rate
10 1.15 Population_log
11 1.20 police_per_100k_lag
Now we will do the same for the Lott and Mustard data.
We will need to use the rename() function (you may recall this is part of the dplyr package) to replace RTC_LAW with RTC_LAWTRUE, as the variable name in the model output is appended by TRUE because it was a logical variable. Because the model formula for the Lott analysis is so complex, it is easier to change this variable name in our new data frame, rather than rewrite the formula for this data.
Recall that we already saved the formula for this data:
Viol_crime_rate_1k_log ~ RTC_LAW + White_Female_10_to_19_years +
White_Female_20_to_29_years + White_Female_30_to_39_years +
White_Female_40_to_49_years + White_Female_50_to_64_years +
White_Female_65_years_and_over + White_Male_10_to_19_years +
White_Male_20_to_29_years + White_Male_30_to_39_years + White_Male_40_to_49_years +
White_Male_50_to_64_years + White_Male_65_years_and_over +
Black_Female_10_to_19_years + Black_Female_20_to_29_years +
Black_Female_30_to_39_years + Black_Female_40_to_49_years +
Black_Female_50_to_64_years + Black_Female_65_years_and_over +
Black_Male_10_to_19_years + Black_Male_20_to_29_years + Black_Male_30_to_39_years +
Black_Male_40_to_49_years + Black_Male_50_to_64_years + Black_Male_65_years_and_over +
Other_Female_10_to_19_years + Other_Female_20_to_29_years +
Other_Female_30_to_39_years + Other_Female_40_to_49_years +
Other_Female_50_to_64_years + Other_Female_65_years_and_over +
Other_Male_10_to_19_years + Other_Male_20_to_29_years + Other_Male_30_to_39_years +
Other_Male_40_to_49_years + Other_Male_50_to_64_years + Other_Male_65_years_and_over +
Unemployment_rate + Poverty_rate + Population_log + police_per_100k_lag
RTC_LAW White_Female_10_to_19_years
1.621662 127.920555
White_Female_20_to_29_years White_Female_30_to_39_years
42.269637 49.635014
White_Female_40_to_49_years White_Female_50_to_64_years
37.550101 36.451868
White_Female_65_years_and_over White_Male_10_to_19_years
12.866751 126.824984
White_Male_20_to_29_years White_Male_30_to_39_years
39.248785 73.008959
White_Male_40_to_49_years White_Male_50_to_64_years
31.613855 52.774694
White_Male_65_years_and_over Black_Female_10_to_19_years
13.285326 335.136906
Black_Female_20_to_29_years Black_Female_30_to_39_years
106.644486 79.058455
Black_Female_40_to_49_years Black_Female_50_to_64_years
98.434064 66.888057
Black_Female_65_years_and_over Black_Male_10_to_19_years
49.715869 320.740453
Black_Male_20_to_29_years Black_Male_30_to_39_years
89.297151 89.267356
Black_Male_40_to_49_years Black_Male_50_to_64_years
92.498486 64.538516
Black_Male_65_years_and_over Other_Female_10_to_19_years
37.960126 142.283700
Other_Female_20_to_29_years Other_Female_30_to_39_years
64.966861 54.511835
Other_Female_40_to_49_years Other_Female_50_to_64_years
224.567085 131.463113
Other_Female_65_years_and_over Other_Male_10_to_19_years
82.394398 151.930450
Other_Male_20_to_29_years Other_Male_30_to_39_years
54.620045 62.267344
Other_Male_40_to_49_years Other_Male_50_to_64_years
244.698473 174.184553
Other_Male_65_years_and_over Unemployment_rate
53.532299 1.497864
Poverty_rate Population_log
1.412397 3.426475
police_per_100k_lag
1.732745
Now to have consistent variable names in the VIF data sets we will rename RTC_LAW back to RTC_LAWTRUE using the str_replace() function of the stringr package. This function replaces a pattern.
# A tibble: 41 x 2
VIF Variable
<dbl> <chr>
1 1.62 RTC_LAWTRUE
2 128. White_Female_10_to_19_years
3 42.3 White_Female_20_to_29_years
4 49.6 White_Female_30_to_39_years
5 37.6 White_Female_40_to_49_years
6 36.5 White_Female_50_to_64_years
7 12.9 White_Female_65_years_and_over
8 127. White_Male_10_to_19_years
9 39.2 White_Male_20_to_29_years
10 73.0 White_Male_30_to_39_years
# … with 31 more rows
# A tibble: 41 x 2
VIF Variable
<dbl> <chr>
1 1.62 RTC_LAW
2 128. White_Female_10_to_19_years
3 42.3 White_Female_20_to_29_years
4 49.6 White_Female_30_to_39_years
5 37.6 White_Female_40_to_49_years
6 36.5 White_Female_50_to_64_years
7 12.9 White_Female_65_years_and_over
8 127. White_Male_10_to_19_years
9 39.2 White_Male_20_to_29_years
10 73.0 White_Male_30_to_39_years
# … with 31 more rows
We can see that some of the VIF values are very high!
Now we will make a plot of the VIF values for both analyses. We will add text to a specific location on the plot using the geom_text() function. Typically a threshold of 10 is used to identify if the VIF are problematically high.
You can also search the above table of results for “RTC” to see how the VIF values differ for the RTC variable between the two analyses. They are close to one another, although the value is slightly higher for Analysis 2.
Next, we will make a couple of plots to illustrate how the VIF values compare between the two models.

We can see that analysis 2 has variables with much higher multicollinearity.
Let’s make this plot a little easier to see using the coord_trans() function of the ggplot2 package with the y ="log10" argument. This does not change the values, but adjusts the way the y axis is displayed with diminishing distance between grid lines.

In many cases it would be advisable to remove one or more of these variables and reassess the VIF values. There are also other options, such as ridge regression. However, both of these options need to be done with care as they can also introduce bias into the model.
In any case the presence of multicollinearity should encourage further investigation about the design of the model, as the results may not be reliable due to the increased level of instability of the coefficient estimates.
See this article for a detailed discussion about what to consider when your model has variables with high VIF values.